-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AVX2 implementation of first_max_element
#2806
base: master
Are you sure you want to change the base?
Conversation
248542a
to
28cb0a9
Compare
28cb0a9
to
db76e3c
Compare
Resulting assembly:
|
It looks like the autovectorizer fails to use ymm registers for the max. Maybe an intrinsic is needed there too? |
I don't think it's possible to use ymm registers for getting the max of 8 32-bit integers, or if it is there probably wouldn't be any benefit. As far as I understand, this is how the algorithm LLVM uses works: max:
; rdi is a pointer to the first element, and it is statically known that there are 8 elements
; load the first 4 elements of the slice into xmm0
vmovdqu xmm0, xmmword ptr [rdi]
; do a packed max the last 4 elements of the slice (rdi + 16)
; basically, xmm0 now looks like this (where x represents the original 8-element array):
; xmm0 = [max(x[0], x[4]), max(x[1], x[5]), max(x[2], x[6]), max(x[3], x[7])]
;
; at this point, we can forget about the entire array and just worry about finding the max of
; the 4 elements that are now in xmm0, since the max is definitely inside xmm0
vpmaxsd xmm0, xmm0, xmmword ptr [rdi + 16]
; shuffle: xmm1 = xmm0[2,3,2,3]
vpshufd xmm1, xmm0, 238
; packed max of xmm0 and xmm1, store the result in xmm0
; here is what the registers look like before this step:
; xmm0 = [x[0], x[1], x[2], x[3]]
; xmm1 = [x[2], x[3], x[2], x[3]]
; so the last 2 elements of xmm0 after this instruction are going to be irrelevant,
; and as you can see, the max will be either the first or second element of xmm0
vpmaxsd xmm0, xmm0, xmm1
; shuffle : xmm1 = xmm0[1,1,1,1]
; xmm0 = [max(x[0], x[2]), max(x[1], x[3]), ...]
; xmm1 = [max(x[1], x[3]), max(x[1], x[3]), ...]
; (other elements omitted since we don't care about them any more)
vpshufd xmm1, xmm0, 85
; packed max of xmm0 and xmm1, store the result in xmm0
vpmaxsd xmm0, xmm0, xmm1
; now xmm0 = [max(max(x[0], x[2]), max(x[1], x[3])), ...]
; so move the first element of xmm0, which is the final max, into eax
vmovd eax, xmm0
ret |
Ah yeah sorry, yeah the first max comparison can only be 4 vs 4. |
db76e3c
to
47ad5a4
Compare
This also now requires BMI1 and BMI2 for AVX2 in `CpuFeatureLevel`.
47ad5a4
to
cff1af9
Compare
I'm confused. Isn't this already optimized from the assembly called from here? https://github.com/xiph/rav1e/blob/master/src/asm/x86/cdef.rs#L195 |
@KyleSiefring I think you're right, I didn't notice this. If I'm not mistaken, the actual assembly corresponding to this particular function seems to be in here: Lines 1773 to 1785 in 2ec4e67
However, I think there could still be at least some value for making |
Could you please split the scalar Your simd code would be good in the not-nasm situation but maybe it would need further discussion. |
Could we turn this into an implementation for a non-nasm platform, e.g. ARM/PPC/RISC-V, that doesn't implement CDEF yet? |
We probably can, I just have to figure out how to do that. I will mark the PR as WIP for now as I try to learn how to write the SIMD implementation for other platforms. |
This also now requires BMI1 and BMI2 for AVX2 in
CpuFeatureLevel
(for tzcnt).